Programmable pipeline fabric having mechanism to terminate signal propagation

ABSTRACT

A method and apparatus for storing and using “register use” information to determine when a register is being used for the last time so that power savings may be achieved is disclosed. The register use information may take the form of “last read” information for a particular register. The last read information may be used to force the value of the register, after being read, to zero or to clock only that register while masking off the other registers. Several methods and hardware variations are disclosed for using the register use information to achieve power savings.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of copending U.S. application Ser.No. 10/222,608 filed 16 Aug. 2002 and entitled Programmable PipelineFabric Having Mechanism to Terminate Signal Propagation.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was developed in part through funding provided byDARPA-ITO/TTO under contract No. DABT63-96-C-0083. The federalgovernment may have rights in this invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is related to reconfigurable architectures and,more particularly, to reconfigurable architectures used to processinformation in a pipelined fashion.

2. Description of the Background

Traditional approaches to reconfigurable computing statically configureprogrammable hardware to perform a user-defined application. The staticnature of such a configuration causes two significant problems: acomputation may require more hardware than is available, and a singlehardware design cannot exploit the additional resources that willinevitably become available in future process generations. A techniquecalled pipelined reconfiguration implements a large logicalconfiguration on a small piece of hardware through rapid reconfigurationof that hardware. With this technique, the compiler is no longresponsible for satisfying fixed hardware constraints. In addition, adesign's performance improves in proportion to the amount of hardwareallocated to that design.

Pipelined configuration involves virtualizing pipelined computations bybreaking a single static configuration into pieces that correspond topipeline stages in the application. Each pipeline stage is loaded, oneper cycle, into the fabric. This makes performing the computationpossible, even if the entire configuration is never present in thefabric at one time.

FIG. 1 illustrates the virtualization process, showing a five-stagepipeline virtualized on a three-stage fabric. FIG. IA shows thefive-stage application and each logical (or virtual) pipeline stage'sstate in six consecutive cycles. FIG. 1B shows the state of the physicalstages in the fabric as it executes this application. In this example,virtual pipe stage 1 is configured in cycle 1 and ready to execute inthe next cycle; it executes for two cycles. There is no physical pipestage 4; therefore, in cycle 4, the fourth virtual pipe stage isconfigured in physical pipe stage 1, replacing the first virtual stage.Once the pipeline is full, every five cycles generates two results fortwo consecutive cycles. For example, cycles 2, 3, 7, 8 . . . consumeinputs and cycles 6, 7, 11, 12, . . . generate outputs.

FIG. 2 is an abstract view of the architectural class of a pipelinedfabric. Each row of processing elements (PEs) together with itsassociated interconnections is referred to as a stripe. Each PEtypically contains an arithmetic logic unit (ALU) and a pass registerfile. Each ALU contains lookup tables (LUTs) and extra circuitry forcarry chains, zero detection, and so on. Designers implementcombinational logic using a set of N B-bit-wide ALUs. The ALU operationis static while a particular virtual stripe resides in a physicalstripe. Designers can cascade, chain or otherwise connect the carrylines of the ALUs to construct wider ALUs, and chain PEs together via aninterconnection network to build complex combinational functions.

One of the key enabling structures for pipeline reconfiguration is thepass register file. An example pass register file 10 is shown in FIG. 3.Pass register file 10 is comprised of four registers 12, 14, 16, 18(which may have an arbitrary bitwidth); a write port consisting of, inthis figure, four multiplexers 20, 22, 24, 26 and a write addressdecoder 28; and a read port, consisting of, in this figure, a 4-to-1multiplexers 30 responsive to a read address. The structure of FIG. 3allows a functional unit connected to this register file 10 to read onevalue from the register file 10 and also allows a functional unit towrite one value into one of the specific registers 12, 14, 16, 18. If avalue is not written into one of the registers 12, 14, 16, 18 by thewrite port, then the value from the corresponding pass register in theprevious pass register file in the previous stripe is written intoregisters 12, 14, 16, 18 via lines 32, 34, 36, 38, respectively.

FIG. 4 illustrates how four pass register files 42, 44, 46, 48 might beused in an application. In this figure, the pass register files 42, 44,46, 48 are connected in a ring, but need not be so connected. In FIG. 4,only one register is shown in each of the register files 42, 44, 46, 48although each of the register files could be arbitrarily large. In FIG.4, data generated by Functional Unit 1 proceeds to Functional Unit 2through one pass register file 44.

A chief problem with the structure of FIG. 4 is that the value, which isonly meant for use by Functional Unit 2, continues through the otherpass register files 46, 48, 42, in subsequent stripes. If the value isnot overwritten by other stripes using this register, such valuescontinue to propagate all the way back to Functional Unit 1. Thisactivity is worthless for the computation, and dissipates significantpower.

A related power consumption problem that occurs in pass register filesin pipeline reconfigurable devices is that old values from previousapplications that were in the chip continue to propagate through thechip, consuming power even though they are irrelevant to the currentcomputation. Thus, the need exist for a mechanism in the pipeline fabricfor terminating signals that are no longer needed for the computation.

SUMMARY OF THE PRESENT INVENTION

The present invention is directed to a method and apparatus for storingand using “register use” information to determine when a register isbeing used for the last time so that power savings may be achieved. Theregister use information may take the form of “last read” informationfor a particular register. The last read information may be used toforce the value of the register, after being read, to a constant or toclock only that register while masking off the other registers. Severalmethods and hardware variations are disclosed for using the “registeruse” information to achieve power savings. Those advantages andbenefits, and others, will be apparent from the Detailed Description ofthe Invention herein below.

BRIEF DESCRIPTION OF THE DRAWINGS

For the present invention to be easily understood and readily practiced,the present invention will now be described, for purposes ofillustration and not limitation, in conjunction with the followingfigures, wherein:

FIGS. 1A and 1B illustrate the process of virtualizing a five-stagepipeline on a three stage reconfigurable fabric;

FIG. 2 illustrates a stripe of a reconfigurable fabric;

FIG. 3 is an example of a pass register file;

FIG. 4 illustrates four pass register files, each having a singleregister, to demonstrate unwanted signal propagation;

FIG. 5 illustrates one embodiment of the present invention forterminating unwanted signal propagation by forcing the value of thesignal to zero;

FIG. 6 illustrates another embodiment of the present invention forterminating unwanted signal propagation by clocking only the registersneeded to produce the value to be read;

FIG. 7 illustrates another embodiment of the present invention forterminating unwanted signal propagation by clocking only the registersneeded to produce the value to be read;

FIG. 8 is a diagram illustrating an embodiment for a mask unit;

FIG. 9 illustrates a modification to the circuit of claim 6 so as to uselocal mask units;

FIG. 10 illustrates a circuit in which registers are clocked by a commonclock signal and four AND gates and a decoder are used to force oneregister to a value of zero; and

FIG. 11 illustrates a modification to the circuit of FIG. 10 to enableeach register to be clocked by its own clock signal.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 5 illustrates one embodiment of the present invention forterminating unwanted signal propagation. In FIG. 5, as is known, eachphysical stripe is configured with a virtual stripe by, for example,writing a configuration word to the physical stripe. A detailedexplanation of configuration management and data management is providedin Schmit, et al, “Managing Pipeline-Reconfigurable FPGAs” published inACM 6^(th) International Symposium on FPGAs, February 1998, the entiretyof which is hereby incorporated by reference. The reader desiring moredetails on the task of writing a configuration word to a physical stripeis referred to the above-identified article. Additional detailsregarding the construction and operation of reconfigurable fabrics maybe found in Schmit, et al, “PipeRench: a virtualized programmable datapath in 0.18 Micron Technology”, in Proceedings of the IEEE CustomIntegrated Circuits Conference (CICC), 2002, the entirety of which ishereby incorporated by reference, Schmit, “PipeRench: a reconfigurable,architectural and compiler”, IEEE Computer, pages 70-76 (April 2000),the entirety of which is hereby incorporated by reference, Schmit,“Incremental Reconfiguration for Pipelined Applications”, Proceedings ofthe IEEE Symposium on FPGAs for Custom Computing Machines, pp. 47-55,1997, the entirety of which is hereby incorporated by reference andSchmit et al, “PipeRench: A Coprocessor for Streaming MultimediaAcceleration”, International Symposium on Computer Architecture, pp.38-49, 1999, the entirety of which is hereby incorporated by reference.

One aspect of the present invention is to include some additionalinformation in the encoding of a stripe (e.g. in the configuration word)that indicates whether a read from the register file is the last read ofthat data value in the application. The “last read” information can begenerated by the compiler or physical design tool that generates thevirtual stripe information, or it can be done by a separate program thatanalyzes a set of virtual stripes to determine when is the last read.The first and last stripes in an application present special cases. Inthe last stripe in a virtual application, there are no subsequentstripes. Therefore, there are no further reads of values in the registerfile. In the first virtual stripe, none of the values currently in theregister files in physical stripes that are located before the firstvirtual stripe are going to be used. For stripes other than the firstand last stripes in an application, the information about the last timea value in a register needs to be read (sometimes referred to as thelast read information) can be used in a number of ways to reduce powerconsumption.

FIG. 5 illustrates one embodiment for using the last read information toreduce power consumption by masking the value after a final read. InFIG. 5, there are four register files 42, 44, 46, 48 each having oneregister 42′, 44′, 46, 48′, respectively, for purposes of simplicity.The reader will understand that in practice each register file will havea plurality of registers as shown, for example, in FIG. 3. In addition,the reader will understand that each register could store more than onebit. In the actual PipeRench implementation described in the previouspublications, each register in each register file stores eight bits. Inthe embodiment of FIG. 5, the last read information is used to fix thevalue in subsequent stripes in the fabric to a constant value. In theembodiment of FIG. 5 that is accomplished with an AND 52 gate locatedprior to (or in) register file 42, AND 54 gate located prior to (or in)register file 44, AND 56 gate located prior to (or in) register file 46,and AND 58 gate located prior to (or in) register file 48. Assuming thatthe value read from register 44′ is the last time that value needs to beread, inputting a zero on one of the input terminals of the AND gate 56forces the value at the output terminal of the AND gate 56, and in thesubsequent pass register files, to zero. The value input to the inputterminals of the other AND gates 52, 54, and 58 is not of significancein terminating the propagation of the signal produced by the register44′. Other gates that can be used in place of the AND gates include ORgates, a NAND gate. Any type of gate that exhibits a monotonic function,i.e. a gate that “forces” the output based on a controlling value at oneof the inputs, can be used.

It will be noticed that the value output by register 44′ is terminated,i.e. prevented from propagating, by AND gate 56 by forcing that value tozero. In a register, clocking in a constant value consumes less powerthan clocking in a changing value. Thus, forcing the value to zeroresults in power savings. A similar result can be achieved by masking ofthe multiplexor read bit for the appropriate multiplexor responsive tothe last read register so that the value output by the register is nolonger read when no longer needed.

In FIG. 6 another method of using the last read information to stop asignal from propagating and for saving power is illustrated. The circuitof FIG. 6 is similar to the circuit of FIG. 5 except that the AND gates52, 54, 56, 58 are positioned to receive a clock signal 60. The clocksignal output by AND gates 52, 54, 56, 58 is input to registers 42′,44′, 46′ and 48′, respectively. Another way the last read informationcan be used to reduce power in a register is to stop the register fromclocking. In FIG. 6, that is performed by masking (blocking) the clocksignal 60 to those registers 42′, 46′, 48′ that are unused by inputtinga zero to one of the input terminals of AND gates 52, 56, 58,respectively. Only the one register 44′ in use is actually clocked byinputting a one to one of the input terminals of the AND gate 54, whichsaves significant clock distribution power, as well the power dissipatedin the register itself. The set of values input to AND gates 52, 54, 56,58 (e.g. 0100) may be referred to as a clocking mask.

FIG. 7 illustrates a somewhat more complex embodiment of the circuitshown in FIG. 6 in that instead of the providing a plurality of gatesand a clocking mask to the gates, information is provided to a pluralityof mask units 62, 64, 66, 68 which locally determine if registers withinregister files 42, 44, 46, 48, respectively, should be clocked. Thedesign of FIG. 7 requires the additional circuitry of the mask units 62,64, 66, 68 and two AND gates per mask unit to compute the value of theclock mask variable for each stripe (register file). The clock mask bitis determined based on what happened “most recently” in each registerwithin each register file. What happened most recently is determinedfrom the inputs “ReadAdd0”, “ReadAdd1”, “WriteAdd”, “LastRead0”,“LastRead1”, and “LastVirtual”, as well information on the state of theprevious mask unit. If that register has been “read for the last time”,then the clock is masked off. If the register has been written morerecently than it has been “read for the last time”, the clock isenabled. That can be implemented with a small finite state machinereceiving the inputs identified above.

In this state machine, shown in FIG. 8, a register in the register filewould be clocked if that register is not in the last virtual stripe andwas either written in this stripe (as indicated by the write address) orwas clocked in the previous stripe and was not the last read (asindicated by the read address and the last read bit corresponding tothat port).

FIG. 9 illustrates the circuit of FIG. 6 modified to provide local maskunits.

The previous embodiments use exactly the same information, whether avalue in a register is being read for the last time, to determine thatthe value should not be allowed to propagate, either by forcing thevalue to a constant (e.g. zero) or not clocking the registers, to reducepower. When the pass register file includes more than one register, thecombination of the read port address (which specifies which register isbeing accessed), and the bit indicated “last read” can be combined todetermine which value is being read for the last time in theapplication. There are other ways to encode this information which, atpresent, seem less efficient. For example, it is possible to have anexplicit “in-use” bit for each register in each register file such thatit would not be necessary to combine the information with the read portaddress. Thus, the present invention is directed to using any “registeruse” information for power savings.

Furthermore the information that a stripe is either the first or lastvirtual stripe can also be used by the mask unit to save power. At thefirst virtual stripe, the application knows that any data coming fromprevious stripes is not meaningful for this application. This bogus datacould be the results from a prior computation that was executed on thestripes in the fabric. As a result, a mask unit that is informed that astripe is the first virtual stripe could mask the clock or gate the datafor any data arriving from a physical stripe prior to the physicalstripe containing the first virtual stripe.

FIG. 10 shows a complex register file with four registers, two readports, one write port, and a set of four gates that can make the outputvalues from a register that has been read for the last time constant.FIG. 11 shows a register file with the same parameters as FIG. 10, butwith separate clocks that would be generated by a mask unit. Theregister file in FIG. 11, if it were reduced to containing tworegisters, could be used in FIG. 7 to replace 44.

Finally, to address the special cases of the first and last virtualstripe, a register file should have unused register file entries masked(e.g. see FIG. 10) or have their clocks gated by, for example, providingseparate clock signals for each register (See FIG. 11).

While the present invention has been described in connection withpreferred embodiments thereof, those of ordinary skill in the art willrecognize that many modifications and variations are possible. Thepresent invention is intended to be limited only by the following claimsand not by the foregoing description.

1.-16. (canceled)
 17. A power saving method, comprising: providingconfiguration information to each of a plurality of series connectedpass register files, each pass register file comprised of a plurality ofregisters; providing clock pulses to each of said pass register files;determining for each pass register file, if the registers within saidpass register file should be clocked with said clock pulses based on aread address, a write address, and a last read data for said passregister file; and selectively applying said clock pulses to theregisters within each of said pass register files based on saiddetermining.
 18. The method of claim 17 wherein said determining isperformed one of remotely or locally with respect to each of said passregister files.
 19. The method of claim 17 wherein said determining isadditionally based on a state of a preceding pass register file in saidplurality of series connected pass register files.
 20. The method ofclaim 19 wherein said determining is performed by a state machine. 21.The method of claim 17 wherein said determining is performed by aplurality of mask units each positioned locally with respect to one ofsaid pass register files, and wherein said selectively applying isperformed by a plurality of logic gates, each responsive to one of saidplurality of mask units and each receiving said clock pulses.
 22. Apower saving circuit for use in a reconfigurable apparatus of the typeconstructed of a plurality of serially connected pass register files,each pass register file constructed of a plurality of registers, saidpower saving circuit comprising: a plurality of mask units, eachproducing a signal for controlling the application of clock pulses toone of said pass register files based on a read address, a writeaddress, and a last read data for said pass register file; and aplurality of logic gates, each responsive to one of said mask units forselectively applying clock pulses to the registers within one of saidpass register files.
 23. The circuit of claim 22 wherein each of saidmask units is located one of remotely or locally with respect to each ofsaid pass register files.
 24. The circuit of claim 22 wherein each ofsaid mask units is additionally responsive to a state of a mask unit fora preceding pass register file in said plurality of series connectedpass register files.
 25. The circuit of claim 22 wherein each of saidmask units includes a state machine.
 26. The circuit of claim 22 whereinsaid plurality of logic gates includes a plurality of AND gates.
 27. Areconfigurable apparatus, comprising: a plurality of series connectedpass register files each comprised of a plurality of registers, each ofsaid pass register files adapted to receive configuration information; aplurality of mask units, each producing a signal for controlling theapplication of clock pulses to one of said pass register files based ona read address, a write address, and a last read data for said passregister file; and a plurality of logic gates, each responsive to one ofsaid mask units for selectively applying clock pulses to the registerswithin one of said pass register files.
 28. The apparatus of claim 27wherein each of said mask units is located one of remotely or locallywith respect to each of said pass register files.
 29. The apparatus ofclaim 27 wherein each of said mask units is additionally responsive to astate of a mask unit for a preceding pass register file in saidplurality of series connected pass register files.
 30. The apparatus ofclaim 27 wherein each of said mask units includes a state machine. 31.The apparatus of claim 27 wherein said plurality of logic gates includesa plurality of AND gates.